On performance of data mining: from algorithms to management systems for data exploration

نویسنده

  • Paolo Palmerini
چکیده

Data Mining (DM) is the science of extracting useful and non-trivial information from the huge amounts of data that is possible to collect in many and diverse fields of science, business and engineering. Due to its relatively recent development, Data Mining still poses many challenges to the research community. New methodologies are needed in order to mine more interesting and specific information from the data, new frameworks are needed to harmonize more effectively all the steps of the mining process, new solutions will have to manage the complex and heterogeneous source of information that is today available for the analysts. A problem that has always been claimed as one of the most important to address, but has never been solved in general terms, is about the performance of DM systems. Reasons for this concern are: (i) size and distributed nature of input data; (ii) spatio temporal complexity of DM algorithms; (iii) quasi real-time constraints imposed by many applications. When it is not possible to control the performance of DM systems, the actual applicability of DM techniques is compromised. In this Thesis we focused on the performance of DM algorithms, applications and systems. We faced this problem at different levels. First we considered the algorithmic level. Taking a common DM task, namely Frequent Set Counting (FSC), as a case study, we performed an in depth analysis of performance issues in FSC algorithms. This led us to devise a new algorithm for solving the FSC problem, that we called Direct Count and Intersect (DCI). We also proposed a more general characterization of transactional datasets that allow to forecast, within a reasonable range of confidence, important properties in the FSC process. From preliminary studies, it seems that measuring the entropy of a dataset can provide useful hints on the actual complexity of the mining process on such data. Performance of DM systems is particularly important for those system that have strong requirements in terms of response time. Web servers are a notable example of such systems: they produce huge amounts of data at different levels (access log, structure, content). If mined effectively and efficiently, the knowledge extracted from these data can be used for service personalization, system improvement or site modification. We illustrate the application of DM techniques to the web domain and propose a DM system for mining web access log data. The system is designed to be tightly coupled with the web server and process the input stream of http requests in an on-line and incremental fashion. Due to the intrinsically distributed nature of the data being mined and by the commonly claimed need for high performance, parallel and distributed architectures often constitute the natural platform for data mining applications. We parallelized some DM algorithms, most notably the DCI algorithm. We designed a multilevel parallel implementation of DCI, explicitly targeted at the execution on cluster of SMP nodes, that adopts multithreading for intra node and message passing for the inter node communications. To face the problems of resource management, we also study the architecture of a scheduler that maps DM applications onto large scale distributed environments, like Girds. We devised a strategy to predict the execution time of a generic DM algorithm and a scheduling policy that effectively takes into account the cost of transferring data across distributed sites. The specific results obtained in studying single DM kernels or applications can be generalized to wider classes of problems that allows the data miner to abstract from the architectural details of the application (physical interaction with the data source, base algorithm implementation) and concentrate only on the mining process. Following this generic thinking, a Data Mining Template Library (DMTL) for frequent pattern mining was designed at the Rensselaer Polytechnic Institute, Troy (NY) USA, under the supervision of prof. M. J. Zaki. We joined the DMTL project during 2002 fall. We present the main features of DMTL and a preliminary experimental evaluation.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

S3PSO: Students’ Performance Prediction Based on Particle Swarm Optimization

Nowadays, new methods are required to take advantage of the rich and extensive gold mine of data given the vast content of data particularly created by educational systems. Data mining algorithms have been used in educational systems especially e-learning systems due to the broad usage of these systems. Providing a model to predict final student results in educational course is a reason for usi...

متن کامل

Identification of Fraud in Banking Data and Financial Institutions Using Classification Algorithms

In recent years, due to the expansion of financial institutions,as well as the popularity of the World Wide Weband e-commerce, a significant increase in the volume offinancial transactions observed. In addition to the increasein turnover, a huge increase in the number of fraud by user’sabnormality is resulting in billions of dollars in lossesover the world. T...

متن کامل

Personal Credit Score Prediction using Data Mining Algorithms (Case Study: Bank Customers)

Knowledge and information extraction from data is an age-old concept in scientific studies. In industrial decision-making processes, the application of this concept gives rise to data-mining opportunities. Personal credit scoring is an ever-vital tool for banking systems in order to manage and minimize the inherent risks of the financial sector, thus, the design and improvement of credit scorin...

متن کامل

An Optimal Model for Medicine Preparation Using Data Mining

Introduction: Lack of financial resources and liquidity are the main problems of hospitals. Pharmacies are one of the sectors that affect the turnover of hospitals and due to lack of forecast for the use and supply of medicines, at the end of the year, encounter over-inventory, large volumes of expired medicines, and sometimes shortage of medicines. Therefore, medicine prediction using availabl...

متن کامل

FUZZY GRAVITATIONAL SEARCH ALGORITHM AN APPROACH FOR DATA MINING

The concept of intelligently controlling the search process of gravitational search algorithm (GSA) is introduced to develop a novel data mining technique. The proposed method is called fuzzy GSA miner (FGSA-miner). At first a fuzzy controller is designed for adaptively controlling the gravitational coefficient and the number of effective objects, as two important parameters which play major ro...

متن کامل

Identification of Fraud in Banking Data and Financial Institutions Using Classification Algorithms

In recent years, due to the expansion of financial institutions,as well as the popularity of the World Wide Weband e-commerce, a significant increase in the volume offinancial transactions observed. In addition to the increasein turnover, a huge increase in the number of fraud by user’sabnormality is resulting in billions of dollars in lossesover the world. T...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004